A Simple Algorithm for Minimal Search Data Index: Active Partitioned Data Search

Using PVM - Algorithm that battles with delay

N. Rajkumar¹^, Dr.S.N. Sivanandam^2,

1 Research Scholar, Dept. of Computer Science & Engineering

2 HOD, Dept. of Computer Science & engineering

PSG College of Technology

Coimbatore – 641 004, India

Email:rnalliah@Yahoo.co.in

Abstract:

In this paper, we examine the issue of mining association rules among items of a large database of search engines. The mining of association rules can be mapped into the problem of discovering large item sets where a large item set is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing an index set of itemsets first and then, identifying, within the subdivided set. Generally this is done iteratively for each large L_kitemsets are equally partitioned into S_ksubsets. To determine the related data of a searching text each index itemset contains the two candidate keys. One is moral text and another one is key that indicates the subset name where the data is originally located. To address this issue, we propose the parallel algorithm for “Minimal Search Data Index(MSDI)” based on the Active Partition Data Search (APDS). Extensive simulation study is conducted to evaluate performance of the proposed algorithm

Keywords: Parallel Algorithm, Parallel Virtual Machine, DataMining

Introduction

1.1 Parallel Virtual Machine

The system called Parallel Virtual Machine (PVM) allows a heterogeneous collection of workstations and computers to construct a parallel processing environment. PVM is portable and runs on a wide variety of modern systems. The PVM system permits a heterogeneous collection of UNIX computers networked together to be viewed by a user program as a single parallel environment with some distinctive features:

Explicit message-passing model: For a collection of computational tasks, each one performs computations.
Heterogeneity support: PVM supports heterogeneity in terms of machines, networks and applications.
Multiprocessor support: PVM uses the native message-passing facilities on multiprocessors to take advantage of the underlying hardware.
Process-based computation: The unit of parallelism in PVM is a task, which is an independent sequential thread of control that transfers between communication and computation. PVM tasks are identified by an integer task identifier. The task identifiers are supplied by the PVM system and are not user chosen, because they must be unique across the entire virtual system.
User-configured host pool: The computational tasks execute on a set of machines that are selected by the user for program execution. Both single-processor and multiprocessor computers may part of the host pool. The host pool may be altered by adding and deleting machines during operation.

A PVM computational model is shown in Figure 1.1. The dotted lines indicate integer component communication and synchronization, and the solid lines indicate the inter-instance communication and synchronization.

Input and Partitioning

Fig 1.1: PVM Computational Model

A user writes one or more sequential programs in C, C++ or Fortran 77, which are programming languages that are supported by PVM. The programs contain embedded calls to the PVM system libraries.
The programs are compiled for each architecture in the host pool, and resulting object files are placed in a location accessible to all computers in the host pool.
The user starts one copy of one task called master or an initial task by hand from a machine within the host pool. This process subsequently starts other PVM tasks called slaves, eventually resulting in a collection of active tasks that compute locally and exchange messages with each other to solve the problem.

PVM tasks may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, any task in existence may start or stop other tasks or add or delete computers from the virtual machine. Any process may communicate and/or synchronize with any other. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control-flow statements.

Owing to its ubiquitous nature (specifically, the virtual machine concept) and also because of its simple but complete programming interface, the PVM system has gained widespread acceptance in the high –performance scientific computing community.

PVM Packing Algorithm

/*database elements added into the PVM by means of n X n matrix format */

foreach transactions row_iÎSIZE do begin /*SIZE may be network capacity */

foreach transactions col_j Î SIZE do begin

a[row_i][col _j] = SIZE/20 /* divide by means of packet size */

pvm_spawn(“testdrive”,(char**)0,pvmTestDefault,”“, NPROCS,task_id);
/* Call the test drive for data distribution */

/* NPROCS can be a SIZE of the subset */

row_i ++;

col_j ++;

end

foreach rox_iÎ NPROCS do begin

/* Connection Establishment */

pvm_initsend (PvmDataDefault);

/* Data type Implementation */

pvm_pkint (&num_data,1,1);

/* Packing a specified data set */

pvm_pkint (&a[num_data*row_i] [num_data*col_j],num_data,1);

/* Ready for Data Transmission */

pvm_send (tast_id[row_i],1);

end

Receiving End

foreach row_i Î NPROCS do begin

/* Recovery of specified task */

pvm-recv (task_id[I],2);

/* unpack an item for reading */

pvm_upkint (&result [row_i],1,1);

/* Area that is used to substitute a parallel algorithm */

end

1.2 System study

When we deeply think about recent trends of information technology, Database and its mining concept plays a very important role. From the bottom level “shop” to top level research areas, there is a database everywhere.

But how to manage these huge databases is a question. For that “Distributive Rules” of data mining give the initial solution to use a candidate key instead of searching an entire database.

Based on this idea, we proposed the datamining “distributive rules” based algorithm called Minimal Search Data Index (MSDI). In that algorithm Index itemset plays a very important role.

Various algorithms have been proposed to discover the large itemsets. Generally, these algorithms first construct a candidate set of large itemsets based on some heuristics, and then discover the subset that indeed contains large itemsets.

In this paper, we propose an algorithm MSDI for time reduced data search. It is very efficient in the filed of very large database management system. Here APDS algorithm is the backbone connectivity for this MSDI algorithm.

Searching For:

Applications of Data Mining

Bypassed Link

Ref Key I

Data Mining, DSIII

Ref Key II

Ref Key III

Ref Key IV

Data Set III

(25 lakhs Data)

Eg. Data Mining

DATA INDEX

Data Set IV

(25 lakhs Data)

Data Set II

(25 lakhs Data)

Data Set I

(25 lakhs Data)

Huge Database

(More than 1Crore Data)

Search

Fig 1.2 MSDI Architecture

/*L_n= Very Large Database Variable*/

/* Set all the database index point to BOF */

foreach data datasub Î SIZE do begin

forall transaction i_pÎ n do begin

check for data occurrences

if (i_peq L_i ) then do begin

Raise a flag variable with different signals

end

data ++;

end

/*L_i= Node that is connected with flag variable */

if (flv=GREEN) then do begin

while (| L_i[ind] | != ‘ ‘ or |L_i[ind]!=’\t’)

L_q=L_i++;

process (L); /* Overall Index Value */

forall transactions L_qÎ flv do begin

if (L_q!= ‘\0’) then do begin

s[L_dats]=getc(ind[L_dats]

L_dats++

end

L_qf = File name that is red from the index page

fp=fopen(L_qf,”r”)

While((g=getc(fp))!=EOF)

Display the Result Page

Descriptive Algorithm

Case I :

Get a searching request from the user and point the records position of index page to BOF.
Searching it from first to last record of the index page. If finds (if (it==L_i))
Scan the parameter if any exists

while (| L_i[ind] | != ‘ ‘ or |L_i[ind]!=’\t’)

If exists, raise the flag as green, else raise the red flag.

Open the green flaged datasets and read it until the null value exists

if (L_q!= ‘\0’)

Open the file with read mode, display the contents simultaneously in paging format.

Case 2:

If the flag is red, then check it with the next subsets.
Repeat the same process until the full network size is reduced.
Raise a negative message if it does not exist in a network

Data set I

Pascal Mode Generation --------------------------------

Moral Page

(Data Mining)

Index Page

Moral                          Dataset

QAS                            DS4

Security                    DS2

Data Mining              DS3

CSMA/CD                 DS4

Data set IV

CSMA/CD-----------QAS -------------------------------------------

Data set III

Data Mining ---------------------------------------------------------

Fig 1.3 .Busy Establishment of Candidate and Index key that indicate suitable subset in a large database

Comparision Experiments

Values in Milli Seconds.

Subset

MSDI

APDS

Dataset Size

(Index Page Generation)

0.25

------

1,27,500 items

(Subset 2)

-----

0.26

-----Do--

(Subset 3)

------

0.252

----- Do ----

(Subset 4)

0.13

0.252

Total

0.38

0.779

Table 1.1 Comparison of APDS and MSDI in terms of execution time

Fig 1.4 Comparison Chart

Above table shows the relative performance between MSDI and APDS. Here, we use | T|=15, i.e., each transaction has 15 double items on an average, so as to have more large itemsets in later passes for interest of presentation. The execution time of these two algorithms is shown in figure. In the phase 4 of both the algorithms are the same. Phase 4 partitioned the database from very large database. But the other phases are entirely different when compared to MSDI with APDS. Every result may result in different microseconds. The tabulation with the two different database sizes. But when we go to the execution time, the time can be reduced in a large data set when compared to small data set. As mentioned before, in the MSDI algorithm has the options of switching from APDS to MSDI after early passes for better performance, and such an option is not adopted here. It can, nevertheless, be seen from figure that the execution time of the APDS is larger than the total execution time by MSDI.It can be seen from table that the execution time of the first pass of MSDI is additional step compared with APDS. However, MSDI incurs significantly smaller execution times than APDS in later passes, not only in the second pass through out the table.

Scaleup Experiments for MSDI

The increased database sized graph shows that the execution time of MSDI increases linearly as the database size increases, meaning that MSDI possesses the same important feature as APDS. Also, we examine the performance of MSDI as the number of items, N, increases. Table shows the execution times of MSDI when the number of item increases from 1, 00,000 to 1, 50,000 for these data sets 1.14 msec and 0.94 msec respectively. In other words, a large transaction has a larger likelihood of having large itemsets to process than a small transaction. Also, given a fixed minimum support, when the number of items N increases, the execution time to obtain L, increases since the size of L1 is usually close to N1 but the execution time to obtain larger k-itemsets decreases since the support for an itemset is averaged out by more items and then decreases. Consequently, as shown in table, when N increases, execution time for small transactions increase a little more prominently than those for large transactions. Performance of MSDI when the number of items increases

N	Index	Reference
1,90,000	0.12Msec	0.75 Msec
3,00,000	0.09Msec	0.32 Msec
4,50,000	0.06Msec	0.14 Msec

Conclusion

We examined in this paper, the issue of mining distributive rules among various items in a very large database of search engine transactions. The problem of handling large itemsets was solved by constructing an indexed set and then subdivided the large database into number of small scale database. We proposed the algorithm MSDI is especially effective for the identification of specific item in a very large database. We compare this algorithm with APDS algorithm and it concluded that the MSDI algorithm is better than APDS in case of data items increases. Extensive simulation study has been concluded to evaluate performance of the proposed algorithm and that simulation guides “MSDI is the algorithm that battles with delay”.

REFERENCES

1) R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, Wahington, D.C., May 1993

2) R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design, Implementation, and Experience,” IEEE Trans. Knowledge and Data Eng., vol.8, pp. 962-969,1996

3) R. Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int’lconf. Very Large Data Bases, pp. 487-499, Santiago, Chile, Sept. 1994.

4) R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int’l Conf. Data Eng., pp.265-276,Tucson,Ariz., May 1997.

5) S. Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket: Generalizing Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf. Management of Data, pp. 265-276, Tucson, Ariz., May 1997

6) S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.

7) M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database Perspective,” IEEETrans. Knowledge and Data Engg, Vol.8, PP.866-883, 1996.

8) D.W. Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New Orleans, Feb. 1996.

9) Y. Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 31-44, Miami Beach, Fla.,Dec. 1996.

10) D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,” Proc. 1996 Int’l Conf, Data Engg. PP. 106-114, New Orleans, Feb. 1996.

Keywords: Parallel Algorithm, Parallel Virtual Machine, DataMining

Introduction

PVM Packing Algorithm

Receiving End

Data Set III

DATA INDEX

Data Set IV

Data Set II

Data Set I

Huge Database

Descriptive Algorithm

Data set I

Moral Page

Index Page

QAS DS4

Data set IV

Data set III

Total

Table 1.1 Comparison of APDS and MSDI in terms of execution time

Scaleup Experiments for MSDI

N

Index

Reference

Conclusion

REFERENCES